Skip to content

Add a multi-line string rule#712

Draft
robherring wants to merge 1 commit into
adrienverge:masterfrom
robherring:multi-line-strings
Draft

Add a multi-line string rule#712
robherring wants to merge 1 commit into
adrienverge:masterfrom
robherring:multi-line-strings

Conversation

@robherring

Copy link
Copy Markdown
Contributor

YAML supports 3 styles of multi-line string values: plain, folded, and block. The folded ('>') and block ('|') modifiers define how formatting and some special characters are interpreted. Common problems are using the modifiers when not necessary or not using them when necessary. Add a rule to check for these conditions.

This is a constant source of manual review comments for us (Devicetree schema files in Linux kernel tree) as many users are
completely clueless about multi-line strings in YAML.

I'm looking for feedback on the general approach. Will add more doc examples and tests if the approach is okay.

@coveralls

coveralls commented Mar 3, 2025

Copy link
Copy Markdown

Coverage Status

coverage: 98.738% (-1.1%) from 99.824%
when pulling 521d25f on robherring:multi-line-strings
into e427005 on adrienverge:master.

YAML supports 3 styles of multi-line string values: plain, folded, and
block. The folded ('>') and block ('|') modifiers define how formatting
and some special characters are interpreted. Common problems are using
the modifiers when not necessary or not using them when necessary. Add a
rule to check for these conditions.
@robherring robherring force-pushed the multi-line-strings branch from 8e2fddd to 521d25f Compare March 3, 2025 22:59
@robherring robherring marked this pull request as draft March 6, 2025 21:41
@adrienverge

Copy link
Copy Markdown
Owner

Hello Rob,

The idea is interesting! To understand all the ins and outs, could you give concrete examples of how it would be useful for Device Tree schema files?

In YAML, all the following representations are equivalent (in JSON they would be written as "This is a sentence.\nThis is another sentence.\n\nThis is another paragraph."):

plain flow scalar:
  This is a
  sentence.

  This is another
  sentence.


  This is another
  paragraph.

double-quoted flow scalar:
  "This is a
  sentence.

  This is another
  sentence.


  This is another
  paragraph."

folded block scalar: >-
  This is a
  sentence.

  This is another
  sentence.


  This is another
  paragraph.

literal block scalar: |-
  This is a sentence.
  This is another sentence.

  This is another paragraph.

So if I understand your goal:
- missing-block-token: true could be named forbid-empty-lines-in-flow-scalars: true?
- unnecessary-block-token: true could be named forbid-block-scalars-without-empty-lines: true?

(About option names, I believe it's better to use "forbid-…" or "require-…" for clarity and consistency with other rules. And stay close to the YAML terminology (see flow scalars and block scalars) for precision and future-proofness.)

@robherring

Copy link
Copy Markdown
Contributor Author

Hello Rob,

The idea is interesting! To understand all the ins and outs, could you give concrete examples of how it would be useful for Device Tree schema files?

It's pretty straight-forward as we have cases when to use or not use literal/folded block which are the source of frequent manual review comments. The only place this comes into play are 'description' values as those are the only free form data. None of this really matters until we feed this into something else like some json-schema to docs or a pass thru rtyaml.

These 3 examples are pretty exhaustive for the patterns we have. We don't ever use quotes.

description:
  This is a 
  paragraph.
  This is another
  paragraph.

description:
  This has formatting that
    - needs
    - to
    - be
    - preserved

description: |
  Formatting is not
  important in this case.

So if I understand your goal: - missing-block-token: true could be named forbid-empty-lines-in-flow-scalars: true? - unnecessary-block-token: true could be named forbid-block-scalars-without-empty-lines: true?

I suppose that's better as it encodes exactly what we are checking for. I was a bit worried about corner cases of what's missing or unnecessary. I also thought about making this a regex as essentially we're looking for '\n[\n ]+', but I don't have any use case for other regex's.

There's also the cases of indented lines and special character and character sequences (e.g. ': '), so empty lines doesn't fully capture it. Perhaps forbid-formatted-whitespace-in-flow-scalars?

Another potential check I'm thinking about is line wrapping. Some reason when the rule is wrap at 80 chars, folks wrap at something like 60. Even the line length check doesn't work to enforce <80 because that's set to 110 for exceptions (URLs typically). Maybe we could exclude lines with only leading whitespace. But to fully check wrapping, I think that would have to be rewrap the text at some length and complain if it is different than the input. I bring this up because something like that would catch all the forbid-formatted-whitespace-in-flow-scalars cases.

(About option names, I believe it's better to use "forbid-…" or "require-…" for clarity and consistency with other rules. And stay close to the YAML terminology (see flow scalars and block scalars) for precision and future-proofness.)

FWIW, I was using https://yaml.info which used "block token".

@adrienverge

Copy link
Copy Markdown
Owner

Now that I understand the use case better, I'm questioning myself about whether a yamllint rule would be suited for this.

  • This is about controlling that the content contains / doesn't contain blank lines, whereas yamllint normally only lints the form of YAML. Although here, it's debatable because this would be checked to determine whether the use of block tokens (e.g. > or |) is needed.

  • Forbidding empty lines in flow scalars (= forbidding newline characters in the *resulting string in their content) is very specific to this use case, isn't it?

    There are many cases where it makes sense to have multiline contents, which needs blank lines to be correctly written in a YAML plain flow scalar. For example:

    script:
      GIT_TRACE=1
      git add
      src/dir/dir/dir/dir/dir/up/to/max/allowed/line/length
      --ignore-missing
    
      git
      commit
      --allow-empty
     
      exit
  • Similarly, forbidding block scalars without empty lines inside them is not that common.
    Block scalars without blank lines are often encountered in YAML files:

    script: |
      GIT_TRACE=1 git add src/… --ignore-missing
      git commit --allow-empty
      exit

    In this code from your commit, I haven't understood why you want to forbid the former, but not the latter? Why would the first snippet be more valid?

    #. With ``multi-line-strings: {unnecessary-block-token: true}``
    
       the following code snippet would **PASS**:
       ::
        foo: |
          bar
    
          baz
    
       the following code snippet would **FAIL**:
       ::
        foo: |
          bar
          baz
  • As you just described, what you're trying to enforce is beyond blank lines: it's also about whitespaces and special markers (-, :…)

@robherring

Copy link
Copy Markdown
Contributor Author

Now that I understand the use case better, I'm questioning myself about whether a yamllint rule would be suited for this.

Honestly, I'd rather have a plug-in than have this built-in because I'm not sure there's wider appeal and getting something that's going to work universally is hard. But I saw you pushed back on the plug-in idea. Currently (for me), it doesn't really matter how things are formatted. It's just prose for human consumption. But in the future, I'd like to generate documentation out of it and then how the YAML is parsed and whether the formatting is maintained will matter.

I've looked at coding up something using ruamel, but it's slow in RT mode and doesn't give me the original line breaks in plain flow scalars, so I can't check if flow/block modifier is missing.

* This is about controlling that **the content** contains / doesn't contain blank lines, whereas yamllint normally only lints **the form** of YAML. Although here, it's debatable because this would be checked to determine whether the use of block tokens (e.g. `>` or `|`) is needed.

I'd say it is about checking whether a YAML parser is going to parse the content as intended or reformat it. Certainly there may be cases where the intent isn't clear. My only answer for that is don't turn on this check...

* Forbidding empty lines in flow scalars (= forbidding newline characters in the *resulting string in their content) is very specific to this use case, isn't it?
  There are many cases where it makes sense to have multiline contents, which needs blank lines to be correctly written in a YAML plain flow scalar. For example:
  ```yaml
  script:
    GIT_TRACE=1
    git add
    src/dir/dir/dir/dir/dir/up/to/max/allowed/line/length
    --ignore-missing
  
    git
    commit
    --allow-empty
   
    exit
  ```

While this works, I wouldn't recommend anyone use it because many users don't know YAML well enough to know this and below are equivalent. Even if you do know, it is easy enough to forget to add a '|'. For something like this (a script), I would want to require it to be a block scalar and disallow the others. So maybe this needs to be only for specific keys. That would work for me as it is only 'description' that I need to check. But I guess something applied per key would be a bit of a departure from normal yamllint rules.

* Similarly, forbidding block scalars without empty lines inside them is not that common.
  Block scalars without blank lines are often encountered in YAML files:
  ```yaml
  script: |
    GIT_TRACE=1 git add src/… --ignore-missing
    git commit --allow-empty
    exit
  ```
      
    
  In this code from your commit, I haven't understood why you want to forbid the former, but not the latter? Why would the first snippet be more valid?

Huh? I want to forbid the latter (the FAIL one).

I've run this thru our files. The only kind of false-positive has been something like the latter where the intent was clearly to be a list. So that would require some rework to make lists clearly a list. Given it's just prose, not a big deal for us.

  ```
  #. With ``multi-line-strings: {unnecessary-block-token: true}``
  
     the following code snippet would **PASS**:
     ::
      foo: |
        bar
  
        baz
  
     the following code snippet would **FAIL**:
     ::
      foo: |
        bar
        baz
  ```

* As you just described, what you're trying to enforce is beyond blank lines: it's also about whitespaces and special markers (`-`, `:`…)

Yes, as any of those require either quoting (we also minimize quoting) or a block/flow modifier. At least a ':' will trigger a parsing error, but a '# ' will just silently drop the content when parsed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants